Goto

Collaborating Authors

 manipulation skill


The Developments and Challenges towards Dexterous and Embodied Robotic Manipulation: A Survey

Li, Gaofeng, Wang, Ruize, Xu, Peisen, Ye, Qi, Chen, Jiming

arXiv.org Artificial Intelligence

Achieving human-like dexterous robotic manipulation remains a central goal and a pivotal challenge in robotics. The development of Artificial Intelligence (AI) has allowed rapid progress in robotic manipulation. This survey summarizes the evolution of robotic manipulation from mechanical programming to embodied intelligence, alongside the transition from simple grippers to multi-fingered dexterous hands, outlining key characteristics and main challenges. Focusing on the current stage of embodied dexterous manipulation, we highlight recent advances in two critical areas: dexterous manipulation data collection (via simulation, human demonstrations, and teleoperation) and skill-learning frameworks (imitation and reinforcement learning). Then, based on the overview of the existing data collection paradigm and learning framework, three key challenges restricting the development of dexterous robotic manipulation are summarized and discussed.


Manipulate as Human: Learning Task-oriented Manipulation Skills by Adversarial Motion Priors

Ma, Ziqi, Tian, Changda, Gao, Yue

arXiv.org Artificial Intelligence

In recent years, there has been growing interest in developing robots and autonomous systems that can interact with human in a more natural and intuitive way. One of the key challenges in achieving this goal is to enable these systems to manipulate objects and tools in a manner that is similar to that of humans. In this paper, we propose a novel approach for learning human-style manipulation skills by using adversarial motion priors, which we name HMAMP. The approach leverages adversarial networks to model the complex dynamics of tool and object manipulation, as well as the aim of the manipulation task. The discriminator is trained using a combination of real-world data and simulation data executed by the agent, which is designed to train a policy that generates realistic motion trajectories that match the statistical properties of human motion. We evaluated HMAMP on one challenging manipulation task: hammering, and the results indicate that HMAMP is capable of learning human-style manipulation skills that outperform current baseline methods. Additionally, we demonstrate that HMAMP has potential for real-world applications by performing real robot arm hammering tasks. In general, HMAMP represents a significant step towards developing robots and autonomous systems that can interact with humans in a more natural and intuitive way, by learning to manipulate tools and objects in a manner similar to how humans do.


Actron3D: Learning Actionable Neural Functions from Videos for Transferable Robotic Manipulation

Zhang, Anran, Chen, Hanzhi, Burkhardt, Yannick, Zhong, Yao, Betz, Johannes, Oleynikova, Helen, Leutenegger, Stefan

arXiv.org Artificial Intelligence

We present Actron3D, a framework that enables robots to acquire transferable 6-DoF manipulation skills from just a few monocular, uncalibrated, RGB-only human videos. At its core lies the Neural Affordance Function, a compact object-centric representation that distills actionable cues from diverse uncalibrated videos-geometry, visual appearance, and affordance-into a lightweight neural network, forming a memory bank of manipulation skills. During deployment, we adopt a pipeline that retrieves relevant affordance functions and transfers precise 6-DoF manipulation policies via coarse-to-fine optimization, enabled by continuous queries to the multimodal features encoded in the neural functions. Experiments in both simulation and the real world demonstrate that Actron3D significantly outperforms prior methods, achieving a 14.9 percentage point improvement in average success rate across 13 tasks while requiring only 2-3 demonstration videos per task.


Prompt-to-Product: Generative Assembly via Bimanual Manipulation

Liu, Ruixuan, Huang, Philip, Pun, Ava, Deng, Kangle, Aggarwal, Shobhit, Tang, Kevin, Liu, Michelle, Ramanan, Deva, Zhu, Jun-Yan, Li, Jiaoyang, Liu, Changliu

arXiv.org Artificial Intelligence

Creating assembly products demands significant manual effort and expert knowledge in 1) designing the assembly and 2) constructing the product. This paper introduces Prompt-to-Product, an automated pipeline that generates real-world assembly products from natural language prompts. Specifically, we leverage LEGO bricks as the assembly platform and automate the process of creating brick assembly structures. Given the user design requirements, Prompt-to-Product generates physically buildable brick designs, and then leverages a bimanual robotic system to construct the real assembly products, bringing user imaginations into the real world. We conduct a comprehensive user study, and the results demonstrate that Prompt-to-Product significantly lowers the barrier and reduces manual effort in creating assembly products from imaginative ideas.


Ag2x2: Robust Agent-Agnostic Visual Representations for Zero-Shot Bimanual Manipulation

Xiong, Ziyin, Chen, Yinghan, Li, Puhao, Zhu, Yixin, Liu, Tengyu, Huang, Siyuan

arXiv.org Artificial Intelligence

Figure 1: Ag2x2 enables zero-shot acquisition of bimanual manipulation skills without relying on expert demonstrations or engineered rewards. The framework operates in two key stages: (left) learning coordination-aware visual representations directly from human manipulation videos (shown in sequential frames of cooking with highlighted hand) while preserving critical hand position data despite domain differences; and (right) leveraging these representations to acquire diverse bimanual manipulation skills in simulation autonomously, demonstrated through multiple Franka robot arms performing sequential steps of various tasks including cabinet opening (top row), door manipulation (middle row), and rope handling (bottom row). Abstract -- Bimanual manipulation, fundamental to human daily activities, remains a challenging task due to its inherent complexity of coordinated control. Recent advances have enabled zero-shot learning of single-arm manipulation skills through agent-agnostic visual representations derived from human videos; however, these methods overlook crucial agent-specific information necessary for bimanual coordination, such as end-effector positions. We propose Ag2x2, a computational framework for bimanual manipulation through coordination-aware visual representations that jointly encode object states and hand motion patterns while maintaining agent-agnosticism. Extensive experiments demonstrate that Ag2x2 achieves a 73.5% success rate across 13 diverse bimanual tasks from Bi-DexHands and PerAct This performance outperforms baseline methods and even surpasses the success rate of policies trained with expert-engineered rewards. Furthermore, we show that representations learned through Ag2x2 can be effectively leveraged for imitation learning, establishing a scalable pipeline for skill acquisition without expert supervision. Ziyin Xiong and Yinghan Chen contributed equally to this work.


BT-TL-DMPs: A Novel Robot TAMP Framework Combining Behavior Tree, Temporal Logic and Dynamical Movement Primitives

Liu, Zezhi, Wu, Shizhen, Luo, Hanqian, Qin, Deyun, Fang, Yongchun

arXiv.org Artificial Intelligence

In the field of Learning from Demonstration (LfD), enabling robots to generalize learned manipulation skills to novel scenarios for long-horizon tasks remains challenging. Specifically, it is still difficult for robots to adapt the learned skills to new environments with different task and motion requirements, especially in long-horizon, multi-stage scenarios with intricate constraints. This paper proposes a novel hierarchical framework, called BT-TL-DMPs, that integrates Behavior Tree (BT), Temporal Logic (TL), and Dynamical Movement Primitives (DMPs) to address this problem. Within this framework, Signal Temporal Logic (STL) is employed to formally specify complex, long-horizon task requirements and constraints. These STL specifications are systematically transformed to generate reactive and modular BTs for high-level decision-making task structure. An STL-constrained DMP optimization method is proposed to optimize the DMP forcing term, allowing the learned motion primitives to adapt flexibly while satisfying intricate spatiotemporal requirements and, crucially, preserving the essential dynamics learned from demonstrations. The framework is validated through simulations demonstrating generalization capabilities under various STL constraints and real-world experiments on several long-horizon robotic manipulation tasks. The results demonstrate that the proposed framework effectively bridges the symbolic-motion gap, enabling more reliable and generalizable autonomous manipulation for complex robotic tasks.


NeSyPack: A Neuro-Symbolic Framework for Bimanual Logistics Packing

Li, Bowei, Yu, Peiqi, Tang, Zhenran, Zhou, Han, Sun, Yifan, Liu, Ruixuan, Liu, Changliu

arXiv.org Artificial Intelligence

--This paper presents NeSyPack, a neuro-symbolic framework for bimanual logistics packing. Our NeSyPack combines data-driven models and symbolic reasoning to build an explainable hierarchical framework that is generalizable, data-efficient, and reliable. It decomposes a task into subtasks via hierarchical reasoning, and further into atomic skills managed by a symbolic skill graph. The graph selects skill parameters, robot configurations, and task-specific control strategies for execution. This modular design enables robustness, adaptability, and efficient reuse--outperforming end-to-end models that require large-scale retraining. Using NeSyPack, our team won the First Prize in the What Bimanuals Can Do (WBCD) competition at the 2025 IEEE International Conference on Robotics & Automation (ICRA). Logistics packing is a crucial task in the warehouse industry, which requires personnel to select the appropriate items and pack them into a shipping box.


Train Robots in a JIF: Joint Inverse and Forward Dynamics with Human and Robot Demonstrations

Khandate, Gagan, Wang, Boxuan, Park, Sarah, Ni, Weizhe, Palacious, Jaoquin, Lampo, Kate, Wu, Philippe, Ho, Rosh, Chang, Eric, Ciocarlie, Matei

arXiv.org Artificial Intelligence

Pre-training on large datasets of robot demonstrations is a powerful technique for learning diverse manipulation skills but is often limited by the high cost and complexity of collecting robot-centric data, especially for tasks requiring tactile feedback. This work addresses these challenges by introducing a novel method for pre-training with multi-modal human demonstrations. Our approach jointly learns inverse and forward dynamics to extract latent state representations, towards learning manipulation specific representations. This enables efficient fine-tuning with only a small number of robot demonstrations, significantly improving data efficiency. Furthermore, our method allows for the use of multi-modal data, such as combination of vision and touch for manipulation. By leveraging latent dynamics modeling and tactile sensing, this approach paves the way for scalable robot manipulation learning based on human demonstrations.


iManip: Skill-Incremental Learning for Robotic Manipulation

Zheng, Zexin, Cai, Jia-Feng, Wu, Xiao-Ming, Wei, Yi-Lin, Tang, Yu-Ming, Zheng, Wei-Shi

arXiv.org Artificial Intelligence

The development of a generalist agent with adaptive multiple manipulation skills has been a long-standing goal in the robotics community. In this paper, we explore a crucial task, skill-incremental learning, in robotic manipulation, which is to endow the robots with the ability to learn new manipulation skills based on the previous learned knowledge without re-training. First, we build a skill-incremental environment based on the RLBench benchmark, and explore how traditional incremental methods perform in this setting. We find that they suffer from severe catastrophic forgetting due to the previous methods on classification overlooking the characteristics of temporality and action complexity in robotic manipulation tasks. Towards this end, we propose an incremental Manip}ulation framework, termed iManip, to mitigate the above issues. We firstly design a temporal replay strategy to maintain the integrity of old skills when learning new skill. Moreover, we propose the extendable PerceiverIO, consisting of an action prompt with extendable weight to adapt to new action primitives in new skill. Extensive experiments show that our framework performs well in Skill-Incremental Learning. Codes of the skill-incremental environment with our framework will be open-source.


KineSoft: Learning Proprioceptive Manipulation Policies with Soft Robot Hands

Yoo, Uksang, Francis, Jonathan, Oh, Jean, Ichnowski, Jeffrey

arXiv.org Artificial Intelligence

Underactuated soft robot hands offer inherent safety and adaptability advantages over rigid systems, but developing dexterous manipulation skills remains challenging. While imitation learning shows promise for complex manipulation tasks, traditional approaches struggle with soft systems due to demonstration collection challenges and ineffective state representations. We present KineSoft, a framework enabling direct kinesthetic teaching of soft robotic hands by leveraging their natural compliance as a skill teaching advantage rather than only as a control challenge. KineSoft makes two key contributions: (1) an internal strain sensing array providing occlusion-free proprioceptive shape estimation, and (2) a shape-based imitation learning framework that uses proprioceptive feedback with a low-level shape-conditioned controller to ground diffusion-based policies. This enables human demonstrators to physically guide the robot while the system learns to associate proprioceptive patterns with successful manipulation strategies. We validate KineSoft through physical experiments, demonstrating superior shape estimation accuracy compared to baseline methods, precise shape-trajectory tracking, and higher task success rates compared to baseline imitation learning approaches.